Realtime Speech-to-Text API
Real-time speech-to-text streaming over a single WebSocket connection.
This endpoint is a dedicated transcription stream: you push raw audio frames in and receive incremental transcriptions and, optionally, translations back as JSON.
Step 1. Get API credentials
Get your Client ID and Client Secret from the
Palabra API keys section.
Step 2. Create a session
Exchange your credentials for a short-lived token by calling
POST /session-storage/session. The publisher field in the response is the
token you pass when connecting.
import requests
def get_token(client_id: str, client_secret: str) -> str:
resp = requests.post(
"https://api.palabra.ai/session-storage/session",
json={"data": {}},
headers={"ClientId": client_id, "ClientSecret": client_secret},
)
resp.raise_for_status()
return resp.json()["data"]["publisher"]
Step 3. Connect
Open a WebSocket to the endpoint below. The token from Step 2 must be
passed as the token query parameter.
All other stream settings are passed as query parameters in the same URL.
wss://api.palabra.ai/asr/v1/speech-to-text/stream?token=<TOKEN>&language=en&format=pcm_s16le&sample_rate=16000
import websockets
url = (
"wss://api.palabra.ai/asr/v1/speech-to-text/stream"
f"?token={token}&language=en&format=pcm_s16le&sample_rate=16000"
)
ws = await websockets.connect(url)
Query parameters
| Parameter | Required | Description |
|---|---|---|
token | yes | Session token |
format | yes | Audio format (see Audio formats) |
sample_rate | conditional | Sample rate in Hz. Required for all raw PCM formats; for pcm_s16le required only when the rate is not 16000 |
language | no | Source language code. Defaults to auto |
translate_languages | no | Comma-separated target languages, e.g. es,de,fr |
enable_filler_filter | no | Whether to enable the filler filter. true by default for all languages except ja |
Step 4. Send audio
Send audio as raw binary WebSocket frames. Chunks of 320 ms are recommended.
await ws.send(data)
Audio formats
format | sample_rate | Notes |
|---|---|---|
pcm_s16le | only if ≠ 16000 | 16-bit signed little-endian PCM. Recommended |
pcm_f32le / pcm_f32be | required | 32-bit float PCM |
pcm_s32le / pcm_s32be | required | 32-bit signed PCM |
mulaw / alaw | required | G.711 |
webm / mp3 / aac / ogg / flac / wav | not used | Container formats; rate is read from the stream |
Step 5. Receive messages
All server-to-client messages are JSON text frames. Switch on message_type.
transcription
Emitted continuously as speech is recognized.
{
"message_type": "transcription",
"transcription_id": "a1b2c3d4",
"language": "en",
"is_eos": false,
"segment": {
"text": "Hello world how are",
"start_time": 0.32,
"end_time": 1.84
},
"delta": {
"text": "how are",
"start_time": 1.20,
"end_time": 1.84
}
}
| Field | Description |
|---|---|
transcription_id | Stable id for the segment. All messages of one segment share the same id. A new id means a new segment has started |
language | Detected (or configured) source language of this segment |
is_eos | false — partial; the segment is still being updated. true — the segment is committed and final |
segment.text | The full text of the segment so far |
segment.start_time / end_time | Segment timing, in seconds relative to session start |
delta | Incremental hint: the text added since the previous partial of the same segment (see below) |
Working with delta
When the filler filter is disabled, delta.text is append-only:
each transcription message carries exactly the text appended
since the previous partial, so you can concatenate deltas directly.
With the filler filter enabled, the recognizer's tail might be rewritten mid-segment, which breaks the append relationship. In that mode treat
segment.text as authoritative and overwrite the current segment on each message; use delta only
as a hint.
translated_transcription
Sent only when translate_languages is set, once per target language, after
each final (is_eos: true) transcription.
{
"message_type": "translated_transcription",
"transcription_id": "a1b2c3d4",
"language": "es",
"is_eos": true,
"segment": {
"text": "Hola mundo, ¿cómo estás?",
"start_time": 0.32,
"end_time": 1.84
}
}
transcription_id matches the id of the source transcription (the
is_eos: true one) this translation was produced from — use it to correlate a
translation back to its original segment. language here is the target
language, and is_eos is always true (translations are produced only for
finalized segments).
Errors
Authentication and routing failures are reported as HTTP status codes during the WebSocket upgrade, before the connection is established:
| HTTP status | Meaning |
|---|---|
401 | Missing or invalid token |
409 | A session is already active for this identity |
After a successful upgrade, the server does not send application-level error messages over the wire — it closes the connection with a standard WebSocket close frame.
Complete example
Streams microphone audio and prints transcriptions (and translations, if
PALABRA_LANGUAGE targets are configured).
pip install pyaudio websockets requests
export PALABRA_CLIENT_ID=... # from Step 1
export PALABRA_CLIENT_SECRET=...
export PALABRA_LANGUAGE=en # source language
import json
import os
import asyncio
import threading
import queue
import pyaudio
import requests
import websockets
WS_URL = "wss://api.palabra.ai/asr/v1/speech-to-text/stream"
SESSION_URL = "https://api.palabra.ai/session-storage/session"
LANGUAGE = os.environ.get("PALABRA_LANGUAGE", "en")
SAMPLE_RATE = 16000
CHANNELS = 1
CHUNK = 5120 # samples ≈ 320 ms at 16 kHz (recommended chunk size)
def get_token() -> str:
resp = requests.post(
SESSION_URL,
json={"data": {
"subscriber_count": 0,
"publisher_count": 1,
"publisher_can_subscribe": True,
}},
headers={
"ClientId": os.environ["PALABRA_CLIENT_ID"],
"ClientSecret": os.environ["PALABRA_CLIENT_SECRET"],
},
)
resp.raise_for_status()
return resp.json()["data"]["publisher"]
def mic_reader(audio_queue: queue.Queue, stop_event: threading.Event):
pa = pyaudio.PyAudio()
stream = pa.open(
format=pyaudio.paInt16,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=CHUNK,
)
print("Microphone open, speak now...")
try:
while not stop_event.is_set():
audio_queue.put(stream.read(CHUNK, exception_on_overflow=False))
finally:
stream.stop_stream()
stream.close()
pa.terminate()
async def stream(token: str):
url = (
f"{WS_URL}?token={token}&language={LANGUAGE}"
f"&format=pcm_s16le&sample_rate={SAMPLE_RATE}"
)
audio_queue: queue.Queue = queue.Queue()
stop_event = threading.Event()
threading.Thread(
target=mic_reader, args=(audio_queue, stop_event), daemon=True
).start()
async with websockets.connect(url) as ws:
print("Connected")
async def send_audio():
loop = asyncio.get_event_loop()
while True:
data = await loop.run_in_executor(None, audio_queue.get)
await ws.send(data) # raw binary frame
async def receive():
async for message in ws:
msg = json.loads(message)
msg_type = msg.get("message_type")
if msg_type == "transcription":
text = msg["segment"]["text"]
tid = msg.get("transcription_id", "")
if msg.get("is_eos"):
print(f"\n[EOS] {text} [{tid}]")
else:
# segment.text is the source of truth — render it whole
print(f"\r {text}", end="", flush=True)
elif msg_type == "translated_transcription":
lang = msg.get("language", "?")
tid = msg.get("transcription_id", "")
print(f"\n[{lang}] {msg['segment']['text']} [{tid}]")
try:
await asyncio.gather(send_audio(), receive())
finally:
stop_event.set()
if __name__ == "__main__":
token = get_token()
print("Session created")
try:
asyncio.run(stream(token))
except KeyboardInterrupt:
print("\nStopped.")